Data-driven Language Independent Word Segmentation Using Character-Level Information
نویسندگان
چکیده
This paper presents a data-driven language independent word segmentation system that has been trained for Chinese corpus at the second Chinese word segmentation bakeoff. The system consists of a base segmentation algorithm and the refining procedures for the undecided character sequences. It does not use any lexicon and the base segmentation is simply done by character bigram and HMM-model is applied for the remaining character sequences. As a final step, high-frequency character trigram modifies the error-prone parts of the text.TT
منابع مشابه
Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging
Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning...
متن کاملWaterloo at NTCIR-3: Using Self-supervised Word Segmentation
In this paper, we describe the system we use in the NTCIR-3 CLIR (cross language IR) task. We participate the SLIR (single language IR) track. In our system, we use a self-supervised word-segmentation technique for Chinese information retrieval, which combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. ...
متن کاملChinese and Japanese Word Segmentation Using Word-Level and Character-Level Information
In this paper, we present a hybrid method for Chinese and Japanese word segmentation. Word-level information is useful for analysis of known words, while character-level information is useful for analysis of unknown words, and the method utilizes both these two types of information in order to effectively handle known and unknown words. Experimental results show that this method achieves high o...
متن کاملText segmentation with character-level text embeddings
Learning word representations has recently seen much success in computational linguistics. However, assuming sequences of word tokens as input to linguistic analysis is often unjustified. For many languages word segmentation is a non-trivial task and naturally occurring text is sometimes a mixture of natural language strings and other character data. We propose to learn text representations dir...
متن کاملWord Segmentation for Urdu OCR System
This paper presents a technique for Word segmentation for the Urdu OCR system. Word segmentation or word tokenization is a preliminary task for understanding the meanings of sentences in Urdu language processing. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A me...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005